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Neuromorphic engineering combines the architectural and computational 
principles of systems neuroscience \A/ith semiconductor electronics, with the 
aim of building efficient and compact devices that mimic the synaptic and 
neural machinery of the brain. Neuromorphic engineering promises ex¬ 
tremely low energy consumptions, comparable to those of the nervous sys¬ 
tem. However, until now the neuromorphic approach has been restricted to 
relatively simple circuits and specialized functions, rendering elusive a direct 
comparison of their energy consumption to that used by conventional von 
Neumann digital machines solving real-world tasks. Here we show that a re¬ 
cent technology developed by IBM can be leveraged to realize neuromorphic 
circuits that operate as classifiers of complex real-world stimuli. These cir¬ 
cuits emulate enough neurons to compete with state-of-the-art classifiers. 
We also show that the energy consumption of the IBM chip is typically 2 or 
more orders of magnitude lower than that of conventional digital machines 
when implementing classifiers with comparable performance. Moreover, 
the spike-based dynamics display a trade-off between integration time and 
accuracy, which naturally translates into algorithms that can be flexibly 
deployed for either fast and approximate classifications, or more accurate 
classifications at the mere expense of longer running times and higher en¬ 
ergy costs. This work finally proves that the neuromorphic approach can be 
efficiently used in real-world applications and it has significant advantages 
over conventional digital devices when energy consumption is considered. 

neuromorphic electronic hardware | VLSI technology | neural networks | classifica¬ 
tion 

Abbreviations: SVM: support vector machine — SV: support vector — RCN: ran¬ 
domly connected neuron 

Introduction Recent developments in digital technology and 
machine learning are enabling computers to perform an in¬ 
creasing number of tasks that were once solely the domain 
of human expertise, such as recognizing a face in a picture 
or driving a car in city traffic. These are impressive achieve¬ 
ments, but we should keep in mind that the human brain 
carries out tasks of such complexity using only a small frac¬ 
tion of the energy needed by conventional computers, the dif¬ 
ference in energy consumption being often of several orders 
of magnitude. This suggests that one way to reduce energy 
consumption is to design machines whose architecture takes 
inspiration from the biological brain, an approach that was 
proposed by Carver Mead in the late 1980s [I] and that is 
now known as “neuromorphic engineering”. Mead’s idea was 
to use very-large-scale integration (VLSI) technology to build 
electronic circuits that mimic the architecture of the nervous 
system. The first electronic devices inspired by this concept 
were analog circuits that exploited the subthreshold properties 
of transistors to emulate the biophysics of real neurons. Nowa¬ 
days the term “neuromorphic” refers to any analog, digital, 
or hybrid VLSI system whose design principles are inspired 
by those of biological neural systems [5] . 

Neuromorphic hardware has convincingly demonstrated 
its potential for energy efficiency, as proven by devices that 
consume as little as a few picojoules per neural event (spike) 
mills]- These devices contain however a relatively small 
number of elements (neurons and synapses) and they can typ¬ 
ically perform only simple and specialized tasks, making it 


difhcult to directly compare their energy consumption to that 
of conventional digital machines. 

The situation has changed recently with the development 
by IBM of the TrueNorth processor, a neuromorphic device 
that implements enough artificial neurons to perform com¬ 
plex real-world tasks, like large-scale pattern classihcation [6]- 
Here we show that a pattern classifier implemented on the 
IBM chip can achieve performances comparable to those of 
state-of-the-art conventional devices based on the von Neu¬ 
mann architecture. More importantly, our chip-implemented 
classifier uses 2 or more orders of magnitude less energy than 
current digital machines performing the same classihcation 
tasks. These results show for the hrst time the deployment 
of a neuromorphic device able to solve a complex task, while 
meeting the claims of energy efficiency contented by the neu¬ 
romorphic engineering community for the last few decades. 


Results 

We chose pattern classihcation as an example of a complex 
task because of the availability of well-established bench¬ 
marks. A classiher takes an input, like the image of a hand¬ 
written character, and assigns it to one among a set of discrete 
classes, like the set of digits. To train and evaluate our clas- 
sihers we used three different datasets consisting of images of 
different complexity (see Fig. 

We start by describing the architecture of the classiher 
that we plan to implement on the neuromorphic chip. The 
classiher is a feed-forward neural network with three layers of 
neurons, and it can be simulated on a traditional digital com¬ 
puters. We will call this network the ‘neural classiher’ to dis¬ 
tinguish it from its hnal chip implementation, which requires 
adapting the architecture to the connectivity constraints im¬ 
posed by the hardware. The neural classiher also differs from 
the hnal hardware implementation in that it employs neu¬ 
rons with a continuous activation function, whereas the IBM 
neuromorphic chip emulates spiking neurons. Despite the dif¬ 
ferences, the functionality of the neural classiher and its hnal 
chip implementation is approximately the same, as we show 
below. We list the procedure for adapting the architecture of 
the neural classiher into its chip implementation as a contribu¬ 
tion in its own right, since it can be directly extended for the 
implementation of generic neural systems on other hardware 
substrates. 

Architecture of the neural classifier Figure illustrates the 
three-layer neural classiher. The hrst layer encodes the pre- 
processed input and projects to the neurons in the interme¬ 
diate layer through connections with random weights. Each 
of these Randomly Connected Neurons (RCNs) receives there¬ 
fore a synaptic current given by a randomly weighted sum of 
the inputs, which the RCNs transform into activation levels 
in a non-linear way—in our case, through a linear rectihca- 
tion function: f{x) = x if a: > 0, and 0 otherwise. The 
combination of a random mixing of the inputs together with 
a non-linear input-output transformation efficiently expands 
the dimensionality of the resulting signal (see e.g. [3 |8| |9]), 
thereby increasing the chances that downstream neurons can 
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Figure 1. Datasets, architecture of the classifier, architecture of a singie core, and chip impiementation. a Samples of the three 
datasets used to evaluate the performance of our classifier. MNIST contains handwritten digits (10 classes); MNIST-back-image contains the digits of the MNIST dataset 
on a background patch extracted randomly from a set of 20 images downloaded from the Internet; IATeX contains distorted versions of 293 characters used in the IATeX 
document preparation system. For more details about the datasets, see Methods, b Architecture of the neural network classifier. The images to classify are preprocessed (see 
Methods) and represented as patterns of activity of a population of input neurons (left, black dots). These input neurons send random projections to a second layer of 
N Randomly Connected Neurons (RCNs) (green circles), which transform nonlinearly their synaptic inputs into firing activity. The activity of the RCNs is then read 
out by a third layer of neurons, each of which is trained to respond to only one class (red circles), c Architecture of a single core in the chip. Horizontal lines represent inputs, 
provided by the axons of neurons that project to the core. Vertical lines represent the dendrites of the neurons in the core (one dendrite per neuron). Active synapses are shown 
as dots in a particular axon-dendrite Junction. The synaptic input collected by the dendrites is integrated and transduced into spike activity at the soma (filled squares on top). 
The spikes emitted by the neuron are sent via its axon to a particular input line, not necessarily on the same core. Blue lines represent the flow of Input and output signals. 
The panel Includes an example of internal connection: the upmost axon carries the output activity of the leftmost neuron in the core (other connections are left unspecified), 
d Implementation of the neural network classifier in a chip with connectivity constraints. The input is fed into all the cores in the RCN layer (shaded blue), whose neurons 
project to the Input lines of readout cores (shaded yellow) in a one-to-one manner (green curves). The outputs of the readout units are combined together off-line to generate 
the response of the output neuron (shaded red). See the main text for the description of the different modules. 


discriminate signals belonging to distinct classes. This dis¬ 
crimination is carried out by a set of output units in the last 
layer, which compute a weighted sum of the RCNs activity. 
The weights are trained so that each output unit responds to 
one separate class {one-vs-all code). Details are given in the 
Methods. Once the network is trained, a class is assigned to 
each input patterns according to which output unit exhibits 
the highest activation. 

Chip implementation of the neural classifier We implemented 
the neural classifier on the IBM neuromorphic chip described 
in |10l |6] . The first step of the conversion of the abstract neu¬ 
ral classiher to an explicit chip implementation is the transfor¬ 
mation of the input patterns into a format that is compatible 
with the spike-based coding of the TrueNorth system. For this 
we simply employ a bring rate coding and convert the integer 
value of every input component to a spike train with a propor¬ 
tional number of spikes, a prescription that is commonly used 
in neurocomputational models such as the Neural Engineering 
framework m- Specifically, input patterns are preprocessed 
and formatted into 256-dimensional vectors representing the 
firing activity of the input layer (the same preprocessing step 
was applied in the neural classiher, see Methods). This vector 
of activities is then used to generate 256 regular-firing spike 
trains that are fed into a set of cores with random and sparse 
connectivity. This set of cores constitutes the RCN layer. Like 
in the neural classiher, the neurons in the RCN layer receive 
synaptic inputs that consist of randomly weighted combina¬ 
tions of the input, and transform their synaptic inputs into 
bring activity according to a nonlinear function. On the chip 
this function is given by the neuronal current-to-rate transduc¬ 
tion, which approximates a linear-rectihcation function [Hi- 
Discriminating the inputs coming from the RCN layer re¬ 
quires each output unit to read from the whole layer of RCNs, 
which in our implementation contains a number of neurons N 
that can be as large as 2^'*. Moreover, all the readout connec- 
titions have to be set at the weights computed by the training 
procedure. These requirements exceed the constraints set by 
the chip design, in terms of the maximal number of both in¬ 
coming and outgoing connections per neuron, as well as the 
resolution and the freedom with which synaptic weights can 
be set. In this paragraph we will present a set of prescrip¬ 


tions that will allow us to circumvent these limitations, and 
successfully instantiate our neural classiher on the IBM sys¬ 
tem. The prescriptions we are presenting are specihc to the 
TrueNorth architecture, but the types of constraints that they 
solve are shared by any physical implementation of neural 
systems, whether it is biological or electronic. It is therefore 
instructive to discuss the constraints and the prescriptions to 
obviate them in detail, as they can be easily extended to other 
more generic settings. 

1. Constraints on connectivity. The IBM chip is organized in 
cores, each of which contains 256 integrate-and-hre neurons 
and 256 input lines that intersect with one another forming 
a crossbar matrix of programmable synapses (Fig. . Each 
neuron can connect to other neurons by projecting its axon 
(output) to a single input line, either on the same core or 
on a different core. With this hardware design the maximum 
number of incoming connections per neuron, or fan in, is 256. 
Likewise, the maximal number of outgoing connections per 
neuron, or fan out, is 256, each of which are restricted to 
target neurons within a single core. 

2. Constraints on synaptic weight precision. Synapses can be 
either inactive or active. The weight of an active synapse can 
be selected from a set of four values given by signed integers 
of 9-bit precision. These values can differ from neuron to neu¬ 
ron. Which of the four values is assigned to an active synapse 
depends on the input line: all synapses on the same input line 
are assigned an index that determines which of the four val¬ 
ues is taken by each synapse (e.g. if the index assigned to the 
input line is 2, all synapses on the input line take the second 
value of the set of four available synaptic weights, which may 
differ from neuron to neuron). 

The design constraints that we just described can be over¬ 
come with the following set of architectural prescriptions. 

PI. Overcoming the constraints on connectivity. We intro¬ 
duced an intermediate layer of neurons, each of which inte¬ 
grates the inputs from 256 out of the total N RCNs. Accord¬ 
ingly, the firing rates of these intermediate neurons represent 
a 256/N portion of the total input to an output unit. These 
partial inputs can then be combined by a downstream neu¬ 
ron, which will have the same activity as the original output 


2 





























































































unit. If the total number of the partial inputs is larger than 
the total number of incoming connections of the neurons that 
represent the output units (in our case 256), the procedure 
can be iterated by introducing additional intermediate layers. 
The final tree will contain a number of layers that scales only 
logarithmically with the total number of RCNs. For simplic¬ 
ity we did not implement this tree on chip and we summed 
off-chip the partial inputs represented by the firing activity 
of the readout neurons. Notice also that this configuration 
requires readout neurons to respond approximately linearly 
to their inputs, which can be easily achieved by tuning read¬ 
out neurons to operate in the linear regime of their current- 
to-rate transduction function (i.e., the regime in which their 
average input current is positive). This procedure strongly 
relies on the assumption that information is encoded in the 
firing rates of neurons; if the spiking inputs happen to be 
highly synchronized and synchronization encodes important 
information, this approach would not work. 

P2. Overcoming the constraints on synaptic weight preci¬ 
sion. Reducing the weight precision after learning usually 
only causes moderate drops in classification performance. For 
example, in the case of random uncorrelated inputs, the scal¬ 
ing properties of the capacity of the classifier (i.e., number 
of classes that can be correctly classified) remain unchanged, 
even when the number of states of the synaptic weights is 
reduced to two m- Instead, the performance drop is catas¬ 
trophically larger when the weight precision is limited also 
during learning [imsi and in some situations the learning 
problem becomes NP-complete [16] In our case the readout 
weights are determined off-chip, using digital conventional 
computers that operate on 64 bit numbers, and then quan¬ 
tized in the chip implementation. The performance drop is 
almost negligible for a sufficient number of synaptic levels. 
In our case we quantized the readout weights of the original 
classifier on an integer scale between —28 and 28. Each quan¬ 
tized weight was then implemented as the sum of four groups 
of 6 synaptic contacts, where each contact in the group can 
either be inactive (value 0) or activated at one of the 6 values: 
±1, ±2, ±4. The multiplicity of this decomposition (19 can be 
for instance decomposed as (4) -I- (1 -I- 4) + (1 -I- 4) -I- (1 + 4) or 
(2)-|-(2-|-4)-|-(2-|-4) + (l-|-4)) is resolved by choosing the decom¬ 
position that is closest to a balanced assignment of the weights 
across the 4 groups (e.g. 19 = (4)-|-(1-|-4)-|-(14-4)-|-(1-|-4)). 
This strategy requires that each original synapse be repre¬ 
sented by 24 synapses. We implemented this strategy by 
replicating each readout neuron 24 times and by distributing 
each original weight across 24 different dendritic trees. These 
synaptic inputs are then summed together by the off-line sum¬ 
mation of all readout neuron activities that correspond to the 
partial inputs to a specific output unit (see Methods for de¬ 
tails). A similar strategy can be used to implement networks 
with synaptic weights that have even a larger number of lev¬ 
els and the number of additional synapses would scale only 
logarithmically with the total number of synaptic levels that 
is required. However, it is crucial to limit individual synapses 
to low values, in order to avoid synchronization between neu¬ 
rons. This is why we limited to 4 the maximum synaptic value 
of individual synapses of the chip. 

Classification performance and speed-accuracy trade-off Our 

neuromorphic classifier implemented on the TrueNorth chip 
was emulated on a simulator developed by IBM. As the 
TrueNorth chip is entirely digital, the simulator reproduces 
exactly the behavior of the chip [10] . In Fig. we show 
the dynamics of two typical runs of the simulator classifying 
images from the MNIST-back-image dataset. Upon image 
presentation, the RCNs in the intermediate layer start inte- 
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Figure 2. The neuromorphic classifier in action, a Spikes emit¬ 
ted by readout neurons during an easy (top) and a difficult (bottom) classification, 
after removing the trend caused by the intrinsic constant currents. Each curve corre¬ 
sponds to the readout output associated with the digit indicated by the color code. 
Samples are drawn from the MNIST-back-image dataset, b Test error as a function 
of classification time (i.e., the time over which spikes are integrated) and energy. The 
error is averaged over the first 1000 test samples of the MNIST (red) and MNIST- 
back-image (blue) datasets. Each dashed horizontal line indicates the best test error 
achieved with support vector classifiers for a given dataset, based on the evaluation 
of the whole test set. c Classification times for different thresholds in spike differ¬ 
ence (as indicated in the legend), for the MNIST and MNIST-back-image datasets. 
For each threshold we plot all classification times (thin lines) as well as the sample 
mean (shorter ticks on top). The performances associated with each threshold are 
indicated in the ^-axis. When the threshold in spike difference is infinite (black), the 
classification is assessed at t = 500ms (i.e., there is no stopping criterion). In all 
panels the chip uses N = 16384 RCNs. 


grating the input signal (not shown) and, a few tens of mil¬ 
liseconds later, they start emitting spikes, which are passed 
to the readout neurons. The figure shows the total number of 
spikes emitted by the readout neurons since input activation, 
after subtracting the overall activity trend caused by baseline 
activity. 

For simple classifications, in which the input is easily rec¬ 
ognizable, the readout neuron associated with the correct class 
is activated in less than 100ms (Fig. [^, top). More diffi¬ 
cult cases require the integration of spikes over longer time 
intervals, as the average synaptic inputs to different readout 
neurons can be very similar (Fig. [^, bottom). This sug¬ 
gests that the performance of the classifier, as measured by 
the classification error rate on the test set, should improve 
with longer integration intervals. This trade-off between speed 
and performance is illustrated in Fig. [^, which shows the 
classification performance versus elapsed time for the MNIST 
and MNIST-back-image datasets. The performance increases 
monotonically with time until it saturates in about half a 
second, with a highest performance of 97.27% for MNIST 
(98.2% with 10-fold bagging), and 77.30% for MNIST-back- 
image. These performances are not too far from the best 
classification results achieved so far: 99.06% for MNIST (us¬ 
ing maxout networks on the permutation invariant version of 
the MNIST dataset, which does not exploit any prior knowl¬ 
edge about the two-dimensional structure of the patterns [n]) 
and 77.39% for MNIST-back-image (with support vector clas¬ 
sifiers [18], although methods combining deep nets, feature 
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learning, and feature selection can achieve performances as 
high as 87.75% [H]). 

Energy-speed-accuracy trade-off As just discussed, accuracy 
has a cost in term of energy because longer integration times 
entail more emitted spikes per classification and a larger base¬ 
line energy costs, which in our case is the dominant contri¬ 
bution to the total energy consumption. We estimated the 
energy consumption as described in section and we found 
that the energy per classification never exceeds 1 mj for our 
network configuration. With the energy needed to keep lit a 
100 W light bulb for a second, one could perform 10® classi¬ 
fications, which is equivalent to around one classification per 
second uninterruptedly for almost one day. Notice that this 
estimate is based on a classihcation that lasts 0.5 s and, there¬ 
fore, does not take into account the fact that most patterns are 
correctly classified in a significantly shorter time (see Fig. 1^, 
top). If the integration and emission of spikes is stoppecias 
soon as one of the output units is signihcantly more active 
than the others, then the average energy consumption can be 
strongly reduced. The criterion we used to decide when to 
stop the integration of spikes (and thus the classification) was 
based on the spikes emitted by the readout units. Specifically, 
we monitored the cumulative activity of each output unit by 
counting all the spikes emitted by the corresponding readout 
neurons. We stopped the classification when the accumulated 
activity of the leading unit exceeded that of the second unit 
by some threshold. The decision was the class associated with 
the leading output unit. 

In Fig. we show the performances and the correspond¬ 
ing classihcation times for several thresholds. Low thresholds 
allow for faster yet less accurate classihcations. In both the 
MNIST and MNIST-back-image datasets, the patterns that 
require long classihcations times are rare. While the perfor¬ 
mance barely changes for large enough thresholds, the average 
classihcation time can be substantially reduced by lowering 
the threshold. For example, for the MNIST dataset the clas¬ 
sihcation time drops by a factor of 5 (from 500 ms to 100 ms) 
and, accordingly, so does the energy consumption (from 1 mJ 
to 0.2 mJ). Faster classihcations are also possible by increas¬ 
ing either the average bring rate or the total number of RCNs, 
both of which entail an increase in energy consumption, which 
might be partially or entirely compensated by the decrease in 
the classihcation time. These expedients will speed up the 
integration of spike-counts and, as a result, the output class 
will be determined faster. 

In all cases both the energy cost and the classihcation per¬ 
formance increase with the total number of emitted spikes or, 
equivalently, with integration time, if the average bring rate 
is hxed. This is a simple form of a more general energy-speed- 
accuracy trade-off, a phenomenon that has been described in 
several biological information-processing systems (e.g. [20]), 
and that can confer great functional hexibility to our classi- 
her. One advantage of basing the computation on a temporal 
accumulation of spikes is that the classiher can be interrupted 
at any time at the cost of reduced performance, but with¬ 
out compromising its function. This is in stark contrast to 
some conventional clock-based centralized architectures whose 
mode of computation crucially relies on the completion of en¬ 
tire monolithic sets of instructions. We can then envisage uti¬ 
lization scenarios where a spiking-based chip implementation 
of our classiher is required to hexibly switch between precise 
long-latency classihcations (like, e.g., those involving the iden- 
tihcation of targets of interest) and rapid responses of limited 
accuracy (like the quick avoidance of imminent danger). 

Notice that both the simulated and implemented net¬ 
works, although entirely feed-forward, exhibit complex dy¬ 


namics leading to classihcation times that depend on the dif- 
hculty associated with the input. This is because neurons are 
spiking and the hnal decision requires some sort of accumu¬ 
lation of evidence. When a stimulus is ambiguous, the units 
representing the diherent decisions receive similar inputs and 
the competition becomes harder and longer. This type of be¬ 
havior is also observed in human brains m- 

We will now focus on the comparison of energy consump¬ 
tion and performance between the neuromorphic classiher and 
more conventional digital machines. 

Energy consumption and performance: comparison with con¬ 
ventional digital machines We compared both the classihca¬ 
tion performance and the energy consumption of our neuro¬ 
morphic classiher to those obtained with conventional digi¬ 
tal machines implementing Support Vector Machines (SVMs). 
SVMs oher a reasonable comparison because they are among 
the most successful and widespread techniques for solving 
machine-learning problems involving classihcation and regres¬ 
sion |22l 1231 , and because they can be efhciently imple¬ 

mented on digital machines. 

To better understand how the energy consumption scales 
with the complexity of the classihcation problem, it is useful 
to summarize how SVMs work. After training, SVMs classify 
an input pattern according to its similarity to a set of tem¬ 
plates, called the support vectors, which are determined by the 
learning algorithm to dehne the boundaries between classes. 
The similarity is expressed in terms of the scalar product be¬ 
tween the input vectors and the support vectors. As argued 
above, we can improve classihcation performance by embed¬ 
ding the input vectors in a higher-dimensional space before 
classifying them. In this case SVMs evaluate similarities by 
computing classical scalar products in the higher-dimensional 
space. One of the appealing properties of SVMs is that there 
is no need to compute explicitly the transformation of inputs 
into high-dimensional representations. Indeed, one can skip 
this step and compute directly the scalar product between the 
transformed vectors and templates, provided that one knows 
how the distances are distorted by the transformation. This is 
known as the “kernel trick” because the similarities in a high¬ 
dimensional space can be computed and optimized over with 
a kernel function applied to the inputs. Interestingly, the ker¬ 
nel associated with the transformation induced by the RCNs 
of our neural classifier can be computed explicitly in the limit 
of a large number of RCNs |18| . This is also the kernel that 
we used to compare the performance of SVMs against that of 
our neural classiher. 

Unfortunately, classifying a test input by computing its 
similarity to all support vectors becomes unwieldy and com¬ 
putationally inefficient for large datasets, as the number of 
support vectors typically scales linearly with the size of the 
training set in many estimation problems m- This means 
that the number of operations to perform, and hence the en¬ 
ergy consumption per classihcation, also scales with the size of 
the training set. This makes SVMs and kernel methods com¬ 
putationally and energetically expensive in many large-scale 
tasks. In contrast, our neural network algorithm evaluates a 
test sample by means of the transformation carried out by 
the RCNs. If the RCN layer comprises N neurons and the 
input dimension is Wn, evaluating the output of a test sam¬ 
ple requires 0(Ain • N) synaptic events. Thus for large sam¬ 
ple sizes, evaluating a test sample in the network requires far 
fewer operations than when using the “kernel trick”, because 
the number is effectively independent of the size of training 
set (cfr. |261I27| 1. Systems such as ours may therefore display 
considerable energy advantages over SVMs when datasets are 
large. 
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Figur© 3. Energy-accuracy trade-off. a,c Dependence of the classification accuracy on the number of Randomly Connected Neurons (RCNs) in the neural 
classifier and on the number of support vectors (SVs) in the SVC, Panel a shows this dependence for the MNIST dataset, and panel b for the MNIST-back-image. As the 
number of RCNs increases, the classifier becomes more accurate at the cost of higher energy consumptions (b,d). The energy consumption is based on the average time it 
takes to the neural classifier to perform the classification (see Fig. |^), We also show the performance achieved by three different implementations of support vector classifiers 
(legend code: SVC, libsvm; rSVC, reduced primal; SVC; SVCperf, cutting plane subspace pursuit). The algorithms rSVC, SVCperf minimize the number of support vectors 
(SVs) with respect to the optimal value and reduce, therefore, the energy consumption levels at test time. The number of SVs used by the standard algorithm (libsvm), on 
the other hand, can go beyond the optimal value by reducing sufficiently the soft-margin parameter and pushing the classifier to overfit the data. In all cases, the energy 
consumption increases linearly with the number of SVs, as the number of operations per classification at test time scales linearly with the number of SVs. The vertical thin 
lines indicate the abscissa at best performance for the IBM chip (red) and SVM implementations (black). For reference we indicate the best performance achieved by the 
chip with a horizontal dashed line. The horizontal arrow indicates the reduction in energy consumption that would be attained if the efficiency of digital machines reached the 
theoretical lower bound estimated by 1281 . The relation between number of SVs and energy consumption was determined by simulating the \1 Intel chip running a program that 
implements an SVM at test time, b Same as a, but on the MNIST-back-image dataset. In both cases our neuromorphic classifier exhibits an energy cost per classification 
that is orders of magnitude smaller. 


In Fig. we compare the energy consumption and per¬ 
formance of the neuromorphic classifier to those of an SVM 
implemented on a conventional digital machine. More specif¬ 
ically, we estimated the energy expenditure of a digital SVM 
using a simulator of the Intel i7 processor, which was the ma¬ 
chine with the best energy performance among those that we 
simulated (see Methods section and Discussion). The en¬ 
ergy cost per support vector per pattern was estimated to be 
around 5.2 ^J, a quantity that is not far above what is con¬ 
sidered as a lower bound on energy consumption for digital 
machines [2H]. For both the neuromorphic classifier and the 
digital SVM we progressively increased the performance of 
the classifiers by increasing the number of RCNs (in the case 
of the neuromorphic classifier), and by varying the number of 
support vectors (in the case of the SVM), see Figures^, c. For 
the SVM we tried three different algorithms to minimize the 
number of support vectors and hence the energy consump¬ 
tion (for more details, see caption of Fig. and Methods). 
For the IBM chip we estimated the energy consumption both 
in the case in which we stopped the classifications with the 
criterion described in the previous section and in the case in 
which the classihcation time was fixed at 500 ms (see Fig. [^in 
Suppl. Info.). In both cases the energy consumption is signif¬ 
icantly lower for the neuromorphic classiher, being in the for¬ 
mer case approximately 2 orders of magnitude smaller for both 
the MNIST and the MNIST-back-image datasets, while still 
achieving comparable maximal performances (Fig. E^-d). 

Scalability The MNIST dataset only has 10 output classes. We 
wondered whether the advantage of the neuromorphic clas¬ 
sifier in terms of energy consumption is preserved when the 
number of classes increases and the classification task becomes 


more complex. To study how the energy consumption scales 
with the number of classes we used the DT)[^ dataset, which 
contains 293 classes of distorted characters. We progressively 
increased the number of classes to be learned and classified 
and we studied the performance and the energy consumption 
of both the digital implementation of the SVM and the neuro¬ 
morphic classifier. Specifically, given a number of classes that 
was varied between 2 and 293, we selected a random subset of 
all the available classes, and we trained both the SVM and the 
neural classifier on the same subset. The results are averaged 
over 10 repetitions, each one with a different sample of output 
classes. 

To make a meaningful comparison between the the en¬ 
ergy consumed by a SVM and the neuromorphic classifier, we 
equalized all the classification accuracies, as follows. For each 
classification problem we varied the margin penalty param¬ 
eter of the standard SVC using grid search and picked the 
best performance achieved. We then varied the relevant pa¬ 
rameters of the other two classihers so that their classihcation 
accuracy matched or exceeded the accuracy of the standard 
SVC. Specihcally, we progressively increased the number of 
basis functions (in the primalSVC method) and the number 
of RCNs (in the neural classiher) until both reached the tar¬ 
get performance. For each classihcation problem we averaged 
over 10 realizations of the random projections of the neural 
classiher. 

The results are summarized in Fig. The energy con¬ 
sumption is about two orders of magnitude larger for the 
SVM throughout the entire range of variation of the num¬ 
ber of classes that we considered, although for a very small 
(2-3) number of classes the advantage of the neuromorphic 
classiher strongly reduces, most likely because the algorithms 
to minimize the number of SVs work best when the number 
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of classes is low. This plot indicates that the energy advan¬ 
tage of the neuromorphic classifier over SVMs implemented 
on conventional digital machines is maintained also for more 
complex tasks involving a larger number of classes. 

It is interesting to discuss the expected scaling for grow¬ 
ing number of classes. Consider the case of generic C classes 
multi-class problems solved through reduction with multiple 
combined binary SVMs. In a one-vs-all reduction scheme, 
each binary classifier is trained to respond to exactly one of 
the C classes, and hence C SVMs are required. For each 
SVM, one needs to compute the scalar products between the 
test sample to be classified and the Vsv support vectors. Each 
scalar product requires Nin multiplications and sums. In the 
favorable case in which all binary classifiers happen to share 
the same support vectors, the scalar products can be com¬ 
puted only once and would require Ni^-Ngy operations. These 
Nsv scalar products then need to be multiplied by the cor¬ 
responding coefficients, which are different for the different 
SVMs. This requires additional CNsv operations. If Ngv 
scales linearly with C, as in the cases we analyzed, then the 
total energy E will scale as 

E ~ iVinC-b c". 

When C is small compared to Wn, the first term dominates, 
and the expected scaling is linear. However, for C > Wn 
the scaling is expected to be at least quadratic. It can grow 
more rapidly if the support vectors are different for different 
classifiers. 

Interestingly the expected scaling for the neural network 
classifiers that we considered is the same. The energy con¬ 
sumption mostly depends on the number of needed cores. This 
number will be proportional to the number of RCNs, N, mul¬ 
tiplied by the number of classes. Indeed, each core can receive 
up to 256 inputs, so the total number of needed cores will be 
proportional to [V/256], with denoting the ceiling func¬ 
tion. Moreover, the number of readout units, which are the 
output lines of these cores, will be proportional to the num¬ 
ber of classes. Hence the NC dependence. In the cases we 
analyzed N depends linearly on the number of classes, and 
hence the energy depends quadratically on C, as in the case 
of the SVMs when C is large enough. Notice that the there 
is a second term which also scales quadratically with C that 
contributes to the energy. The second term comes from the 
necessity of replicating the RCNs C times, due to the limited 
fan out of the RCNs. Again, under the assumption of V ~ C, 
also this term will scale quadratically with C. 

Given that the scaling with the number of classes is basi¬ 
cally the same for the neuromorhic classifier as for the SVMs, 
it is not unreasonable to hypothesize that the energy consump¬ 
tion advantage of the neuromorphic implementation would be 
preserved also for a much larger number of classes. 


Discussion 

Our results indicate that neuromorphic devices are mature 
enough to achieve performances on a real-world machine¬ 
learning task that are comparable to those of state-of-the-art 
conventional devices with von Neumann architecture, all just 
by using a tiny fraction of their energy. Our conclusions are 
based on a few signihcant tests, based on a comparison limited 
to our neuromorphic classifier and a few digital implementa¬ 
tions of SVMs. This clearly restricts the generality of our 
results and does not preclude situations in which the advan¬ 
tage of the neuromorphic approach might be less prominent. 
In any case, the merit of our study is to offer a solid com¬ 
parison with implementations on current conventional digital 


platforms that are energy-efficient themselves. In particular, 
the algorithm we used on conventional digital machines in¬ 
volves only multiplications between matrices and vectors, the 
efficiency of which has been dramatically increased in the last 
decades thanks to optimized parallelization. Furthermore, not 
only we tried to match the classification performance of the 
competitors, but we also considered two additional SVM al¬ 
gorithms that minimize the number of support vectors, and 
hence the final number of operations. Other choices for SVM 
algorithms would certainly lead to different estimates for en¬ 
ergy consumption, but it is rather unlikely that they would 
change across 2 orders of magnitude. It is possible that full 
custom unconventional digital machines based, e.g., on field 
programmable gate arrays (FPGAs) would be more energy- 
efficient, but it is hard to imagine that they would break the 
predicted energy wall discussed in [^. If this assumption is 
right, neuromorphic hardware would always be more efficient 
when performing the type of tasks that we considered. More¬ 
over, analog neuromorphic VLSI or unreliable digital tech¬ 
nologies might allow for a further reduction of energy con¬ 
sumption, probably by another order of magnitude [5l l29l[30| . 
The current energy consumption levels achieved by analog sys¬ 
tems are very close to those of biological brains in terms of en¬ 
ergy per spike, although many of these systems are relatively 
small and it is unclear whether they can ever be extended to 
brain-scale architectures. 

Other custom chips that can solve real-world tasks have 
been designed. An example is the FPGA chip NeuFlow, de¬ 
signed to implement convolutional networks for visual recog¬ 
nition. The chip is digital and uses as little as 4.9 x 
10^^ operations/W or, equivalently, 2 pj/operation. 

It is also interesting to discuss the performance of other 
conventional digital processors in the benchmarks we exam¬ 
ined. Let us consider for example the implementations of 
SVMs classifying the MNIST digits with about 10“^ support 
vectors, which is roughly the number of vectors we need to 
achieve the best classification accuracy. As we have shown, 
the Intel i7 takes about 10 ms to perform a classification, at 
an approximate cost of 50 mj. The IBM chip, in contrast, re¬ 
quired 1 mj for the longest classffication times (500 ms), and 
0.2mJ for the average classification time (100ms). We also 
quantihed the energy cost of the ARMv7, which is a more 
energy-efficient yet slower microprocessor often used in mo¬ 
bile technologies. Its energy consumption per classihcation 
was substantially higher, around 700 mJ. The main reason for 
this high consumption is that it takes more than 0.6 seconds to 
perform a single classification. And the baseline consumption, 
which increases linearly with the classification time, is a large 
portion of the total energy needed for a classification. Finally, 
we considered the recent Xeon Phi, which has a massively par¬ 
allel architecture and is employed in high performance com¬ 
puting applications. As we do not have a simulator for the 
Phi, we could only indirectly estimate a lower bound for the 
energy consumption (see Methods for more details). Accord¬ 
ing to our estimate, a single classification requires only 0.2 fis 
and uses about 16 mJ, which would be significantly lower than 
the energy cost of the i7 and very close to the estimated lower 
limit of energy consumption |28], but still larger than the con¬ 
sumption of the IBM chip. Notice however that both the clas¬ 
sification time and the energy consumption of the Xeon Phi 
processor are very likely to be grossly underestimated, as they 
are simply derived from the peak performance of 100 Tfiop/s. 
The estimates for the i7 and the ARMv7 are significantly more 
reliable, because we derived them by simulating the proces¬ 
sors. 

To summarize, our results compellingly suggest that the 
neuromorphic approach is finally competitive in terms of en- 
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Figure 4. Dependence of the energy consumption on the number of classes a Classification accuracy for the neural classifier and for two SVM 
algorithms, as a function of the number C of classes for the |AT[rX dataset. The parameters of the different classifiers are tuned to have approximately the same classification 
accuracy, b Energy consumption as a function of the number of classes, for the LAT^ dataset. Given a number C of classes, every point in the plot is obtained by training 
a given classifier on C randomly sampled classes among the 293 available ones. This procedure is repeated 10 times for every value of C and every type of classifier. Each 
datapoint associated with the neural classifier ('RCN') was in turn estimated from a sample of 10 realizations of the random connections (squares indicate sample means, 
errorbars indicate the 0.1 and 0.9 fractile of the sample), c As in b, but number of support vectors and RCNs as a function of the number of classes. 
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10 
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10000 

MNIST-back-image 

28 X 28 

10 

12000 

50000 

DTeX 

32 X 32 

293 

14650 

9376 


ergy consumption in useful real-world machine learning tasks 
and constitutes a promising direction for future scalable tech¬ 
nologies. The recent success of deep networks for large-scale 
machine learning |31l I32| makes neuromorphic approaches 
particularly relevant and valuable. This will be certainly true 
for neuromorphic systems with synaptic plasticity, which will 
enable these devices to learn autonomously from experience. 
Learning is now available only in small neuromorphic systems 
|33[ 1341 f^ . but hopefully new VLSI technologies will allow 
us to implement it also in large-scale neural systems. 
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Materials and Methods 

Images sets for classification benchmarksWe used three 
datasets in our study: MNIST, MNIST-back-image, and 
The MNIST dataset consists of images of handrwit- 
ten digits (10 classes) |36| . The MNIST-back-image dataset 
contains the same digits of MNIST, but in this case the back¬ 
ground of each pattern is a random patch extracted from a 
set of 20 black and white images downloaded from the In¬ 
ternet m- Patches with low pixel variance (i.e. containing 
little texture) are discarded. The IM^^X dataset consists of 
distorted versions of 293 characters used in the IMl^K docu¬ 
ment preparation system [381139] . All datasets consist of I x I 
pixel gray-scale images, and each of such pixel images is asso¬ 
ciated with one out of C possible classes. The size of the pixel 
images, the number of classes, and the sizes of the training 
and test sets depend on the data set (see table below). 

Preprocessing Every sample image was reshaped as a 
dimensional vector, and the average gray level of each compo¬ 
nent was subtracted from the data. The dimensionality of the 
resulting image vector was then reduced to 256 using PCA. To 
guarantee that all the selected components contributed uni¬ 
formly to the patterns, we applied a random rotation to the 
principal subspace (see, e.g., [40]). We denote by Vm = 256 
the dimension of that subspace. 


The architecture of the network and the training algorithm 

We map the preprocessed Ain-dimensional vector image, s, 
into a higher dimensional space through the transformation 

Xi = f{-Wi-s), i = 

where Wi is an Ain-dimensional sparse random vector and /(■) 
is a nonlinear function. This is the transformation induced by 
a neural network with Ain input units and A output units 
with activation function /(•). More succinctly, 

x = /(W^s), [1] 

where W is a weight matrix of dimensions Ain x A formed 
by adjoining all the column weight vectors Wi, and where 
/(•) acts componentwise, i.e., /(x) = 

The output of the random nonlinear transformation, x, is 
used as the input to a linear Ac-class discriminant, consist¬ 
ing of Ac linear functions of the type yj = JjkXk, with 

j = 1,... ,C. More compactly, 

y = Jx, [2] 

wher^ = (j/i, ..., yc)'^, J is a CxN matrix, and x is given by 
Eq. Q. A pattern x is assigned to class Cy if yy(x) > i/fe(x) 
for alTj 7^ k. The elements of J are learned offline by impos¬ 
ing a 1-of-Ac coding scheme on the output: if the target class 
is j then the target output t is a vector of length Ac where 
all components are zero except component tj, which is 1. Eor 
the offline training of weights we use the pseudoinverse, which 
minimizes the mean squared error of the outputs. This tech¬ 
nique has been shown to be a good replacement for empirical 
minimization problems when the dataset is embedded in a ran¬ 
dom high-dimensional space, which is our case imiiniiiiiiz]. 

Neuromorphic chip implementation 

The chip is composed of multiple identical cores, each of which 
consists of a neuromorphic circuit that comprises n = 256 ax¬ 
ons, n neurons, and adjustable synapses (iiadolEj , see 
also Fig. [^). Each axon provides the inputs by feeding the 
spiking activity of one given neuron that may or not reside in 
the core. The incoming spiking activity to all n axons in a core 
is represented by a vector of activity bits ..., A„(t)) 

whose elements indicate whether or not the neurons associ¬ 
ated with the incoming axons emitted a spike in the previ¬ 
ous time step. The intersection of the the n axons with the 
n neurons forms a matrix of programmable synapses. The 
weight of active synapses is determined by the type of axon 
and the type of neuron the synapse lies on. Specifically, each 
core can contain up to four different types of axon, labeled 
Gj = {1,2,3,4}, whereas it can accommodate an unlimited 
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number of neuron types, each of which having four associated 

synaptic weights Si = {Sl,..., Sf). The strength of an active 

^ . 

synapse connecting axon j with neuron i is ST, that is, the 

axon type determines which weight to pick among the weights 

associated with neuron i. The net input received by neuron i 

, 

at time step t is therefore hi{t) = “S'; ^where 

Bij is 1 or 0 depending on whether the synapse between axon 
j and dendrite i is active or inactive. 

At each time step the membrane potential Vi{t) of neu¬ 
ron i receiving input h{t) is updated according to Vi{t -|- 1) = 
Vi{t) — p -\-hi{t), where /3 is a constant leak. If Vi{t) becomes 
negative after an update, it is clipped to 0. Conversely, when 
Vi(t) reaches the threshold Vthi, the potential is reset to Keset 
and the neuron emits a spike, which is sent through the neu¬ 
ron’s axon to the target core and neuron. This design implies 
that each neuron can connect to at most n neurons, which 
are necessarily in the same core. The initial voltage of each 
neuron was initialized by drawing randomly and with equal 
probability from a set of 4 evenly spaced values from Keset to 

Khr. 

Signal-to-rate transduction The input to the neuromor- 
phic chip consists of a set of spike trains fed to the neurons 
of the input layer. To transform the vector signal s into spike 
trains, we first shifted the signal by s = Scr, where cr is the 
standard deviation across all signal components of all pat¬ 
terns. The shifted signal was then scaled by a factor cho¬ 
sen to ensure moderate output rates in the RCN layer, and 
the result was linear-rectified to positive values. In short, the 
input rate Vi associated with signal Si is i/i = r'sc [si + s]-i-, 
i = 1,..., Nin, where [a;]+ is a; if x > 0, or 0 otherwise. The 
values Ui were then used to generate regular spike trains with 
fixed inter-spike-interval l/vi. 

Basic architecture The circuit is divided in two functional 
groups, or layers, each of which comprises several cores. The 
first functional group is the RCN layer, which computes the 
random nonlinear expansion in Eq. . 0 . The second func¬ 
tional group computes the C-class discriminant y = Jx. The 
output of the classifier is just argmax^ yj, where j runs over 
the C possible categories. The argmax operation was not com¬ 
puted by the chip, but was determined off-line by comparing 
the accumulated spike counts across all outputs. In the fol¬ 
lowing, we describe the implementation of the two layers in 
more detail. 

RCN layer We first set the dimensionality of the input to 
the number of available axons per core, i.e., Nin — n = 256. A 
convenient choice for W is a n x N matrix where each column 
is vector of zeros except for exactly m < n nonzero entries, 
which are randomly placed and take a fixed integer value w. 
We took m = 26, which corresponds to a connectivity level 
of around 0.1. Lowering the connectivity has the advantage 
of decreasing energy costs by reducing the number of total 
spikes and active synapses, without impacting the classifica¬ 
tion performance. The random expansion was mapped in the 
chip by splitting the matrix into \N/n\ submatrices of 
size n X n, and using each submatrix as the (boolean) connec¬ 
tivity matrix Bij of a core. 

With this arrangement, each of the N neurons distributed 
among the \N/ri\ cores receives a sparse and random linear 
combination of signals. Specifically, the average current re¬ 
ceived by each RCN is 

n 

i=i 


A zero-th order approximation of the firing rate of a gen¬ 
eral VLSI neuron receiving a current hi is 


where Vthr is the threshold for spike emission and Vj-eset is the 
reset potential m- 

We chose the parameters w and /3 to meet two criteria. 
First, we required the fraction of RCNs showing any firing 
activity (i.e., the coding level /) to be around 0.25. This 
coding level is a good compromise between the need for dis¬ 
crimination and generalization, and it keeps finite-size effects 
at bay [^. Second, we required the distribution of activities 
across active RCNs to be sufficiently wide. Otherwise the in¬ 
formation carried by the spiking activity of the RCNs is too 
imprecise to discriminate among patterns. 

All the cores in the RCN layer receive exactly the same 
n-dimensional input signal. 

Readout 

The readout matrix J was trained offline and mapped to the 
chip architecture as follows. 

Weight quantization Because the chip can hold only 
integer-valued synapses, we need to map the set of all compo¬ 
nents of J into an appropriate finite set of integers. We started 
clipping the synaptic weights within the bounds {—4a, 4a), 
where a is the standard deviation of the sample composed of 
all the components of J. We then rescaled the weights to a 
convenient magnitude Jmax = 28 (see below), and rounded 
the weight values to the nearest integer. 

Weight assignment The TrueNorth connectivity con¬ 
straints dictate that each RCN can project to only one axon, 
meaning that there are at most n = 256 synaptic contacts 
available to encode the C = 10 weights, Joi,..., Jgi associ¬ 
ated with the i-th RCN. We allocated 24 contacts per class 
and per axon (see Fig. [^. Each of these 24 contacts were 
divided in four groups comprising 6 weights each, with val¬ 
ues 1, 2,4, —1, —2, —4. This allowed us to represent any inte¬ 
ger weight from —28 to 28 (each of the 4 groups encodes a 
maximum weight of 7, sign aside). To distribute any weight 
value w across the available synaptic contacts, we decom¬ 
posed u) in a sum of four terms, given by the integer divi¬ 
sion of ui by 4 with the remainder spread evenly across terms 
(Ex: 19 = 4-1-5-1-5-1-5). Each of such values was assigned 
to one group, represented in base 2, and mapped to a pat¬ 
tern of active-inactive synapses according to the weight as¬ 
sociated with each axon-dendrite intersection. Positive and 
negative weights, as well as strong and week weights, were 
balanced along a dendrite by changing the sign and order of 
the weights in the crossbar (see alternating colors and satura¬ 
tions in Fig. [^. 

Negative threshold For the readout to work properly, the 
firing activity of readout neurons must be proportional to the 
linear sum of the inputs from the RCNs. This requires neu¬ 
rons to operate in the linear regime of their dynamic range, 
a regime that can be enforced by lowering the threshold /3out 
of readout neurons. We set Pout < 0, which is equivalent 
to adding a constant positive current to each neuron. If 
the current-to-rate transduction function were the threshold- 
linear function of Eq. |®, the baseline activity induced by 
this constant current would be |/3out|/(Khr — Preset) per read¬ 
out neuron. The contribution of this background signal should 




FigurG 5. Implementation of the readout matrix in a core. The diagram represents the first 8 input lines and first 48 dendrites (two output units) 
of a typical readout core. Under each axon-dendrite contact is a square that indicates the potential synaptic strength at the site: color indicates whether the connection is 
excitatory (red) or inhibitory (blue), while the saturation level represents the absolute value of the synaptic strength, which can be 1, 2, or 4 (low, medium, and high saturation, 
respectively). Only the sites marked with a dot are active. The green frame highlights all the synaptic contacts allocated for an arbitrary weight of the readout matrix, in this 
case Ji 3 = —9, which is decomposed as the 4-term sum —9 = —2 — 3 — 2 — 2 = — OIO 2 ~ OII 2 ~ OIO 2 ~ OIO 2 - Note that in this particular axon the ordering of 
weights is 2®, 2^, 2^ (rightmost bit is the most significant). 


be subtracted from the readout outputs if one wants to get the 
equivalent to Eq. 0. although the step is unnecessary if one 
only wishes to compare output magnitudes (as we implicitly 
do in order to find the maximal output). 

Support Vector Machines. We trained SVMs to perform mnl- 
ticlass classifications based on a one-vs-all scheme, so that 
the number of output units coincides with the number of 
classes (as in the neural classifier). SVMs were evaluated us¬ 
ing arc-cosine kernels, whic mimic the computation of large 
feedforward networks with one or more layers of hidden non¬ 
linear units [18]. For our particular architecture, based on 
one hidden layer built with threshold-linear units, the kernel 
is fc(x,y) = ||x||||y|| Ji(0), where Ji{0) = sinS -|- (tt — 6)cos0 
and 0 is the angle between the inputs x and y. 

We considered three types of SVM. For the standard SVM 
we used the open library libsvm [44], which we patched to in¬ 
clude the arccos kernel. The other two SVMs reduce the num¬ 
ber of support vectors without sacrificing performance sub¬ 
stantially. One of such algorithms is primalSVC, which selects 
greedily the basis functions by optimizing the primal objec¬ 
tive function [^ . The other method is based on the so-called 
Cutting-Plane Subspace Pursuit algorithm, which reduces the 
number of support vectors by using basis functions that, un¬ 
like standard SVMs, are not necessarily training vectors |46| . 
Such method is implemented in the library SVMperf. Unlike 
the other two classifiers, SVMperf used RBF kernels instead of 
arccos kernels. 

Estimation of the IBM chip energy consumption The en¬ 
ergy consumption of the IBM chip was estimated from the 
TrueNorth specifications [5]. The total energy consumption 
comprises the baseline energy (15.9pW per core), the energy 
to emit spikes (109pJ per spike), the energy needed to read 
active synapses (10.7pJ per active synapse), and the energy 
necessary to update membrane potentials (1.2 pJ per neuron). 
We ignored the input-output energy needed to transmit spikes 
off chip and receive spikes on chip. These numbers provide 
a reasonable estimate of the energy consumption of systems 
with a conservative supply voltage of 0.775 V; most chips op¬ 
erate near or below this estimate. For a setup with 2^'^ RCNs, 
26 dendrites per class, and 10 classes, the power was about 
2.08 mW, 95% of which corresponds to the baseline power. 


number of classes C increases, so does the number of readout 
neurons necessary to perform a classification and, therefore, 
so does the required number of readout cores. Specifically, if 
we assign Sc synaptic contacts per axon and per class, we will 
need a total of ScC output lines. These output lines need to 
be connected to all the N neurons through the input lines of 
the readout cores . Because each readout core can accomodate 
256 output lines, connected to 256 input lines, the total num¬ 
ber of readout cores will be [V/256] [scC'/256] ([•] indicates 
the ceiling function). In principle the number of RCN cores 
will be simply [V/256] . However, each RCN should project to 
\sc,C/2hQ\ cores, which implies that each RCN core must be 
cloned \sc,C/2hQ\ times due to the fan-out constraint—each 
RCN can project to only one core. The total number of cores 
is therefore Vcores = 2[V/256] [scC'/256], where the factor 2 
accounts for the contributions of both the readout and the 
RCN cores. 

The total number of spikes emitted was estimated from 
the reference value we got from the chip simulation (for 10 
classes, N = 2^^^, Sc = 24, and 500 ms of classification time), 
scaled appropriately for the new Ncoies- More concretely, if 
we denote by n^p the number of spikes emitted during our 
reference simulation, the number of emitted spikes in a gen¬ 
eral case is rigp = n^p|'scC/256](r/500)(V/2^'‘), where T is 
the duration of the simulation in milliseconds. We chose this 
duration to be T = 108 ms, which is the average classification 
time of the chip implementation the MNIST dataset, when 
the spike difference is 80 spikes and which yields only 0.1% 
less in performance than in the fixed-duration case (97.2% 
vs 97.3%). With T and the estimated values of Wores and 



Scaling of the energy with the number of classes The 

estimation was based on the energy cost of the simulated clas¬ 
sifications of the MNIST dataset, and extrapolated to the de¬ 
signs required by an increasing number of classes. As the 


Figure 6. Simulation of a digital support vector machine, a 
Number of operations (black circles, left ordinate) and runtime (blue dots, right ordi¬ 
nate) required by a digital SVM to classify 10 test patterns from the MNIST dataset, 
as a function of the number of support vectors. The SVM performance was estimated 
with a simulator of the Intel i7 processor, b Energy consumption associated to the 
datapoints shown in a (squares). The straight line is a least-square fit. 
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Figure 7. Performance versus energy consumption at a fixed classification time. Panels are like in Figs. [^,d, classification time is now 
fixed at 500 ms, rather than determined by a stopping criterion. 


Usp, it is straightforward to compute the energy consumption 
according to the values given in the previous paragraph. 

Energy consumption in von Neumann digital machines 

Configuration The runtime and power of microprocessors 
with von Neumann architectures were estimated with the re¬ 
cently developed simulators GEM5 (gemS.opt 2.0) |47| and 
McPAT (ver. 1.2) [3H]- For the estimation we used an architec¬ 
ture configuration similar to that of the recent Intel Core"’’’^ 
i7 processors |15], which incorporate state-of-the-art CMOS 
technology. Specifically, we used an x86_64, 03, single core 
architecture at 2.66 GHz clock frequency, with 32KB 8-way 
Ll-i and 32KB 8-way Ll-d caches, 256KB 8-way L2 cache, 
64B cache line size, and 8GB DDR3 1600 DRAM. Channel 
length was 22 nm, HP type, using long channel if appropriate. 
VDD was 0.9V, so slightly higher than the 0.775 V used for 
the IBM chip. However, could we use the same voltage in 
Intel i7 simulator, the energy consumption would be lower by 
a factor (0.775/0.9)^ = 0.74. This 26% reduction would not 
change the main conclusions about the energy consumption 
gap between the IBM chip and the conventional von Neumann 
digital machines, which is 2-3 orders of magnitude. 

Simulations The benchmark was the test phase of the 
SVMs, already trained. Simulations showed that a modern 
microprocessor based on a von-Neumann architecture takes 
115.5 ms to evaluate the test set with 8087 SVs, while con¬ 
suming 424.6 mJ (DRAM energy consumption not included). 
When we varied the number of support vectors from 9 to 
8087, both the runtime and energy consumption grew propor¬ 
tionally to the number of SVs, while the power was roughly 
constant due to the Hxed hardware configuration (see Fig[^. 
To estimate how the energy used by von Neumann digital 
SVMs scales with the number of classes, we ran another set 
of simulations with Intel i7 simulator, this time varying both 
the number of support vectors and the number of classes in 
the classihcation problem. This step was necessary to deter¬ 
mine the overhead incurred when we increase the number of 
output units. For a given number of classes, the energy cost 
per support vector was estimated from the least-square fit of 
the energies against the number of support vectors. 

Mobile processor We also investigated the runtime and 
energy consumption of a more energy-efficient but slower 


mobile microprocessor performing the same target workload. 
The architecture conhguration was: ARMv7, 03, single core, 
IGHz GPU clock frequency, 32kB 4-way Lli and 32kB 4-way 
Lid caches, and 128kB 8-way L2 cache, which is similar to the 
architecture of ARM Cortex-A9 [50]. The technology node 
(22 nm) and simulators were the same as in the experiment 
with the microprocessor mimicking Intel Core i7. For the 
benchmark code with the largest number of SVs, the task re¬ 
quired 1.2-10^° operations that took 6.35 s at a cost of 7.34 J. 


Discussion on Intel Xeon Phi Massively parallel archi¬ 
tectures have gained a significant amount of attention to 
improve the throughput and power efficiency of the high- 
performance computing (HPC) technology, in response to the 
relatively stagnated improvement in clock frequency. The 
Xeon Phi coprocessor, recently developed by Intel, is one of 
such efforts m- It integrates more than 50 CPU cores to¬ 
gether with L1/L2 caches, network-on-chips, GDDR memory 
controller, and PCIe interface. Fach core supports up-to 4- 
thread in-order operation and the 512b SIMD VPU (Vector 
processing unit). While the runtime and energy-consumption 
of the coprocessor are highly dependent on the target work¬ 
loads, several recent investigations quantified the performance 
and energy-efficiency. In the high-performance configuration, 
the system integrating Xeon and Xeon Phi shows the through¬ 
put of 100 Tera floating-point operations (flop) per second, the 
power consumption of 72.9kW, marking the energy efficiency 
of 0.74nJ/flop [51]. The classihcation benchmark codes (with 
the largest number of SVs) require 0.02235 Gigahop on the 
desktop processor conhguration similar to Intel Core i7. At 
a hrst order approximation, therefore, the Xeon and Xeon 
Phi-based system takes 0.2235 ^s and uses 16.5 mJ per clas¬ 
sihcation. This energy consumption seems signihcantly lower 
than the one of the Intel Gore i7, and very close to its lower 
bound, which is approximately 3mJ. However, one should 
keep in mind that the energy is grossly underestimated, as 
not only we ignored the energy needed for the RAM, but we 
also neglected the cost of the non hoating point operations, 
which are approximately twice as many as the hoating point 
operations. For all these reasons it is difficult to compare the 
energy consumption for the Xean Phi to the Intel Core i7. 
In any case, even for our very conservative energy consump¬ 
tion estimate, the IBM chip remains signihcantly more energy 
efficient. 
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